Big Data Analysis in Finance
Module 2
Big Data
Intro
Big data in business
“Without big data, you are blind and deaf and in the middle of a freeway.” — Geoffrey Moore
“In God we trust, all others bring data.” — W. Edwards Deming
Big Data
What is big data?
- Volume: the massive amount (size) of data
- Variety (Complexity): the diversity and complexity of data
- Velocity: the speed of data generation
Big Data by Volume
A rough definition of big data, by volume:
If the data fits your memory: small data
If the data is bigger than memory but less than hard disk: medium data
If the data is even greater than normal disk: big data
How is Big Data Managed?
(Massively) Large datasets are stored in different ways.
Database
Data Warehouse
Data Lake
Database
A well-organized file cabinet
Structured
Usually managed by single provider
Many database uses (accepts) SQL
e.g., Mergent, MSRB, IvyDB
Data Warehouse
- A huge library archive
- Data from various sources
- e.g. photo, video, voice, documents (databases)
- (Semi) structured
- Optimized for looking up, not for quick updates
- e.g., WRDS, Google BigQuery, Amazon RedShift, Snowflake
Data lake
A data pool, or a dumping site
Unstructured, not organized
- Can be structured / semi-structured within
Data from various sources, can be raw
Great way to store massive amount of data, quickly
e.g., AWS S3, Google Cloud Storage, Git LFS
Views on Big Data Approach
If your data fits in memory, there’s no advantage to putting it in a database: it will only be slower and more frustrating.
-Hadley Wickham, Chief scientist @Posit
- Some backend engines challenges this view (duckplyr)
Challenges in Big Data
Many times we don’t have enough power to handle big data:
too big to fit in memory
You can’t use the same toolbox in R/Python
So many choices to consider
How do we handle Big Data?
Two approaches:
- Shrink the data
- Scale the machine
Common Big Data Solutions
Big data problems are often described as “small data problems in disguise”, meaning:
Often what we care is the subset of the large data.
When data is stored in well structured way (i.e., database)
we can bypass loading the whole data to memory
but read only what is relevant.
Approach 1: Shrink the data
Database querying
- Process heavy lifting on the server
- Or out-of-core (disk)
- Retrieve only what you need
Chunk processing
- Read, process, write little by little
Downsampling
- Random sampling
- Temporal
Approach 2: Scaling up
Cloud computing (Virtual Machines / Containers)
High performance clusters (HPC)
Cloud Computing
- On-Demand: Provisioned dynamically, on-demand
- Scalability: Easily scale up or down
- Flexibility: Ideal for diverse applications
- Managed Services: Provided and maintained by commercial vendors
- Pricing:
- Pay-as-you-go
- Economical short-run, expensive long-running jobs
High Performance Computing (HPC)
- Dedicated: Resources reserved for a job’s duration
- Optimized for Performance: Designed for tightly coupled, parallel processing tasks
- Research-Centric: For intensive simulations and analyses
- Pricing:
- Fixed upfront / subscriptions
- Often supported by academic or research institutions
Cloud Computing
Well-known cloud computing providers:
- AWS (Amazon Web Services)
- Microsoft Azure
- Google Cloud Platform (GCP)
- Oracle Cloud
HPC at USF
CIRCE
Managed by Research Computing department
Two main clusters: CIRCE and Secure Cluster for sensitive data
Access through JIRA or email request
Database
A database is a collection of data that is structured and organized.
- Database are provided remotely but can be stored locally.
Database in a nutshell:
A filing cabinet arranges items by alphabetical order:
Files starting “ABC” in the top drawer, “DEF” in the second drawer, etc.
To find Alice’s file, you’d only have to search the top draw.
For Fred, the second draw, and so on.
Types of database
Relational database (RDBMS)
Traditional, often called SQL databases
Data stored in tabular form
Optimal for data not often changing
When accuracy / consistency is crucial (Financial data)
example: PostgreSQL, MySQL
Non-relational database (NoSQL)
Often called NoSQL databases
Data stored in formatted text form (e.g., JSON)
For complex and diverse, changing data to be organized
e.g.) MongoDB
Relational Databases (RDBMS)
Mergent FISD Example
Mergent FISD (Fixed Income Securities Database)
Issue table
Issuer table
DBMS computational forms
Client-server DBMS: run on a server within an organization
- WRDS
Cloud DBMS: Similar to client-server DBMS, but on cloud
- Snowflake, Google BigQuery, Amazon RedShift
In-process DBMS: run entirely on your computer
- DuckDB, SQLite
Suggested Reading
Suggested reading
- Hadley, “R for Data Science” 2ed,
- Ch. 22. Databases